8 research outputs found
Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text
Automatic Arabic diacritization is useful in many applications, ranging from
reading support for language learners to accurate pronunciation predictor for
downstream tasks like speech synthesis. While most of the previous works
focused on models that operate on raw non-diacritized text, production systems
can gain accuracy by first letting humans partly annotate ambiguous words. In
this paper, we propose 2SDiac, a multi-source model that can effectively
support optional diacritics in input to inform all predictions. We also
introduce Guided Learning, a training scheme to leverage given diacritics in
input with different levels of random masking. We show that the provided hints
during test affect more output positions than those annotated. Moreover,
experiments on two common benchmarks show that our approach i) greatly
outperforms the baseline also when evaluated on non-diacritized text; and ii)
achieves state-of-the-art results while reducing the parameter count by over
60%.Comment: Arabic text diacritization, partially-diacritized text, Arabic
natural language processin
Improving Language Model Integration for Neural Machine Translation
The integration of language models for neural machine translation has been
extensively studied in the past. It has been shown that an external language
model, trained on additional target-side monolingual data, can help improve
translation quality. However, there has always been the assumption that the
translation model also learns an implicit target-side language model during
training, which interferes with the external language model at decoding time.
Recently, some works on automatic speech recognition have demonstrated that, if
the implicit language model is neutralized in decoding, further improvements
can be gained when integrating an external language model. In this work, we
transfer this concept to the task of machine translation and compare with the
most prominent way of including additional monolingual data - namely
back-translation. We find that accounting for the implicit language model
significantly boosts the performance of language model fusion, although this
approach is still outperformed by back-translation.Comment: accepted at ACL2023 (Findings
Analyzing And Improving Neural Speaker Embeddings for ASR
Neural speaker embeddings encode the speaker's speech characteristics through
a DNN model and are prevalent for speaker verification tasks. However, few
studies have investigated the usage of neural speaker embeddings for an ASR
system. In this work, we present our efforts w.r.t integrating neural speaker
embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved
embedding extraction pipeline in combination with the Weighted-Simple-Add
integration method results in x-vector and c-vector reaching on par performance
with i-vectors. We further compare and analyze different speaker embeddings. We
present our acoustic model improvements obtained by switching from newbob
learning rate schedule to one cycle learning schedule resulting in a ~3%
relative WER reduction on Switchboard, additionally reducing the overall
training time by 17%. By further adding neural speaker embeddings, we gain
additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based
hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and
Hub5'01 with training on SWB 300h.Comment: Accepted at ITG Speech Communications 202
Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
Subword units are commonly used for end-to-end automatic speech recognition
(ASR), while a fully acoustic-oriented subword modeling approach is somewhat
missing. We propose an acoustic data-driven subword modeling (ADSM) approach
that adapts the advantages of several text-based and acoustic-based subword
methods into one pipeline. With a fully acoustic-oriented label design and
learning process, ADSM produces acoustic-structured subword units and
acoustic-matched target sequence for further ASR training. The obtained ADSM
labels are evaluated with different end-to-end ASR approaches including CTC,
RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show
that ADSM clearly outperforms both byte pair encoding (BPE) and
pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis
shows that ADSM achieves acoustically more logical word segmentation and more
balanced sequence length, and thus, is suitable for both time-synchronous and
label-synchronous models. We also briefly describe how to apply acoustic-based
subword regularization and unseen text segmentation using ADSM.Comment: accepted at Interspeech202
Enhancing and Adversarial: Improve ASR with Speaker Labels
ASR can be improved by multi-task learning (MTL) with domain enhancing or
domain adversarial training, which are two opposite objectives with the aim to
increase/decrease domain variance towards domain-aware/agnostic ASR,
respectively. In this work, we study how to best apply these two opposite
objectives with speaker labels to improve conformer-based ASR. We also propose
a novel adaptive gradient reversal layer for stable and effective adversarial
training without tuning effort. Detailed analysis and experimental verification
are conducted to show the optimal positions in the ASR neural network (NN) to
apply speaker enhancing and adversarial training. We also explore their
combination for further improvement, achieving the same performance as
i-vectors plus adversarial training. Our best speaker-based MTL achieves 7\%
relative improvement on the Switchboard Hub5'00 set. We also investigate the
effect of such speaker-based MTL w.r.t. cleaner dataset and weaker ASR NN.Comment: accepted at ICASSP 202
Development of Hybrid ASR Systems for Low Resource Medical Domain Conversational Telephone Speech
Language barriers present a great challenge in our increasingly connected and
global world. Especially within the medical domain, e.g. hospital or emergency
room, communication difficulties and delays may lead to malpractice and
non-optimal patient care. In the HYKIST project, we consider patient-physician
communication, more specifically between a German-speaking physician and an
Arabic- or Vietnamese-speaking patient. Currently, a doctor can call the
Triaphon service to get assistance from an interpreter in order to help
facilitate communication. The HYKIST goal is to support the usually
non-professional bilingual interpreter with an automatic speech translation
system to improve patient care and help overcome language barriers. In this
work, we present our ASR system development efforts for this conversational
telephone speech translation task in the medical domain for two languages
pairs, data collection, various acoustic model architectures and
dialect-induced difficulties.Comment: ASR System Paper for HYKIST projec